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Online Supervised Subspace Tracking 

Yao Xie*, Ruiyang Song, Hanjun Dai, Qingbin Li, Le Song 


Abstract — We present a framework for supervised subspace 
tracking, when there are two time series xt and yt, one 
being the high-dimensional predictors and the other being the 
response variables and the subspace tracking needs to take into 
consideration of both sequences. It extends the classic online 
subspace tracking work which can be viewed as tracking of Xt 
only. Our online sufficient dimensionality reduction (OSDR) is 
a meta-algorithm that can be applied to various cases including 
linear regression, logistic regression, multiple linear regression, 
multinomial logistic regression, support vector machine, the 
random dot product model and the multi-scale union-of-subspace 
model. OSDR reduces data-dimensionality on-the-fly with low- 
computational complexity and it can also handle missing data 
and dynamic data. OSDR uses an alternating minimization 
scheme and updates the subspace via gradient descent on the 
Grassmannian manifold. The subspace update can be performed 
efficiently utilizing the fact that the Grassmannian gradient with 
respect to the subspace in many settings is rank-one (or low-rank 
in certain cases). The optimization problem for OSDR is non- 
convex and hard to analyze in general; we provide convergence 
analysis of OSDR in a simple linear regression setting. The 
good performance of OSDR compared with the conventional 
unsupervised subspace tracking are demonstrated via numerical 
examples on simulated and real data. 

Index Terms — Subspace tracking, online learning, dimension¬ 
ality reduction, missing data. 


I. Introduction 

Subspace tracking plays an important role in various signal 
and data processing problems, including blind source separation 
0, dictionary learning Q, Q, online principal component 
analysis (PCA) 0, 0, imputing missing data Q, denoising 
and dimensionality reduction 0 - To motivate online 
supervised subspace tracking, we consider online dimen¬ 
sionality reduction. Applications such as the Kinect system 
generate data that are high-resolution 3D frames of dimension 
1280 by 960 at a rate of 12 frames per second. At such a 
high rate, it is desirable to perform certain dimensionality 
reduction on-the-fly rather than storing the complete data. In 
the unsupervised setting, dimensionality reduction is achieved 
by PCA, which projects the data using dominant eigenspace 
of the data covariance matrix. However, in many signal and 
data processing problems, side information is available in the 
form of labels or tasks. For instance, the data generated by the 
Kinect system contains the gesture information (e.g. sitting, 
standing, etc.) ®, 0. A supervised dimensionality reduction 
may take advantage of the side information in the choice of 
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Fig. 1: Example to illustrate the difference of unsupervised and 
supervised subspace tracking, (a): a case where dimensionality 
reduction along largest eigenvector is optimal and both supervised 
and unsupervised dimensionality reduction pick the dominant eigen¬ 
vector; (b); a case where dimensionality reduction along the second 
largest eigenvector is optimal. Unsupervised dimensionality reduction 
erroneously picks the largest eigenvector since it only maximizes the 
variance of the predictor variable, but the supervised dimensionality 
reduction, by considering the response variable, correctly picks the 
second largest eigenvector. 

the subspaces for dimensionality rednction. The supervised 
dimensionality reduction is a bit more involved as it has two 
objectives: making a choice of the subspace that represents 
the predictor vector and choosing parameters for the model 
that relates the predictor and response variables. 

Existing online snbspace tracking research has largely 
focused on unsnpervised learning, including the GROUSE 
algorithm (based on online gradient descent on the Grass¬ 
mannian manifolds) 0, 1^, |[TT|, the PETRELS algorithm 
and the MOUSSE algorith m p3) . Local convergence of 
GROUSE has been shown in ]11| in terms of the expected 
principle angle between the true and the estimated subspaces. 
A preliminary exploration for supervised subspace tracking is 
reported in fT?) , which performs dimensionality reduction on 
the predictor vector without considering the response variable 
in the formulation. What can go wrong if we perform subspace 
tracking using only the predictor {xt} but ignoring the response 
variable {yt} (e.g., the approach in |14))? Eig. 0and Eig. 
demonstrate instances in classification, where unsupervised 
dimensionality reduction using a subspace may completely fail 
while the supervised dimension reduction does it right. 

In this paper, we present a general framework for supervised 
subspace tracking which we refer to as the online supervised 
dimensionality reduction (OSDR), which is a meta-algorithm 
that may be applied to various models. OSDR simultaneously 
learns the snbspace and a predictive model throngh alternat¬ 
ing minimization, and the formulation of OSDR takes into 





(b) (c) 


Fig. 2: Simulated data correspond to the case in Figj^b): (a): three- 
dimensional data cloud D = 3 with two classes; (b): scatter plot 
of the data projected by unsupervised subspace tracking into a two- 
dimensional space d = 2; the two classes are completely overlapped 
in the projected space; (c): scatter plot of the data projected by 
supervised subspace tracking into a two-dimensional space d = 2; 
the two classes are well-separately. More details for this example can 
be found in Section [V| 


consideration both the high-dimensional predictor sequence 
{xt} and the response sequence {t/*}. We explore different 
forms of OSDR under various settings, with the loss function 
being induced by linear regression, logistic regression, multiple 
linear regression, multinomial logistic regression, support 
vector machine, random dot-product model, and union-of- 
subspaces model, respectively. A common stmcture is that the 
Grassmannian gradient of the cost function with respect to the 
subspace U is typically rank-one or low-rank (e.g., rank-A: for 
the fc-classihcation problem, or the rank being dependent on 
the number of samples used for a mini-batch update). This 
structure enables us to develop a simple and efficient update 
for U along the geodesic. Due to the orthogonality requirement 
and bilinearity, the optimization problem involved in OSDR is 
non-convex. We provide convergence analysis for OSDR in a 
simple linear regression setting. Good performance of OSDR 
is demonstrated on simulated and real data. 

A notion in statistics related to our problem is sufficient 
dimensionality reduction, which combines the idea of dimen¬ 
sionality reduction with the concept of sufficiency. Given 
a response variable y, a D-dimensional predictor vector x, 
a dimensionality reduction statistic R{x) is sufficient if the 
distribution of y conditioned on R{x) is the same as that of y 
conditioned on x. In other words, in the case of sufficiency, 
no information about the regression is lost by redncing the 
dimension of x. Classic sufficient dimensionality reduction 
methods include the sliced inverse regression (SIR) ng. 


which nses the inverse regression curve, E[a;| 2 /] to perform 
a weighted principle component analysis; more recent works 
1161 use likelihood-based sufficient dimensionality reduction in 
estimating the central subspace. From this perspective, OSDR 
can be viewed as aiming at sufficiency for online snbspace 
based dimensionality reduction. In the offline setting, a notable 
work is sufficient dimension reduction on manifold GZ) which 
considers the problem of discovering a manifold that best 
preserves information relevant to a non-linear regression using 
a convex optimization formulation. 

The rest of the paper is organized as follows. Section [H] sets 
up the formalization and the meta-algorithm form for OSDR. 
Section III presents specihc OSDR algorithms under various 
settings. Section IV includes theoretical analysis. Section |V] 
contains numerical examples based on simulated and real data. 

The notation in this paper is standard: R-|_ denotes the 
set of positive real numbers; H = {1,2, ...,n}; (a;)+ = 
maxja:, 0} for any scalar x', [x]j denotes the jth element 
of a vector x; I{e} is the indicator function for an event e; 
||a;||i and ||x|| denote the ii norm and £2 norm of a vector x, 
respectively; X'^ denotes transpose of a vector or matrix X 
and ||-V|j denotes the Frobenius norm of a matrix AT; is the 
identity matrix of dimension d-hy-d. Dehne the sign function 
sgn(x) is equal to 1 if x > 0 and is equal to 0 if x < 0. 


II. OSDR; THE META-ALGORITHM 

Assume two time-series (xt, yt), t = 1,2, .. such that yt 
can be predicted from the high-dimensional vector xt G R^. 
Here yt G y can be either binary, real, or vector-valued. To 
reduce the data dimensionality, we project xt by a subspace 
Ut G R^^'^ to Xt, with d D, to obtain a projected vector (or 
feature) Uj x*. Here d is the number of principle components 
we will use to explain the response series. Ideally, we would 
like to choose an Ut that maximizes the prediction power 
of UjXt for yt, and Ut can be time-varying to be data- 
adaptive. With such a goal, we formulate the online supervised 
dimensionality reduction (OSDR), which simultaneously learns 
the subspace Ut and a function that relates Xt and yt, by 
minimizing a cost function which measures the quality of the 
projection in terms of predictive power. The optimization prob¬ 
lem of minimizing the loss function is inherently non-convex, 
due to the orthogonality requirement for Up. UjUt = Id, as 
well as bilinear terms arising from coupling of Ut and the 
model parameters. Hence, we develop the algorithm based on 
an alternating minimization and stochastic gradient descent 
scheme. 

We consider two related formulations: the d-formulation, 
which assumes the response function only depends on the 
projected components and hence is a compact model that 
fits into the goal of dimensionality reduction. However, the 
d-formulation cannot deal with missing data. Then we also 
introduce the D-formulation, which handles missing data or can 
also be used for denoising. The D-formulation estimates the 
projection /3 of the data using the subspace from the previous 
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step, and then uses j3 to update the subspace; however, it 
requires us to store high-dimensional model parameters. The 
loss function for the d- and the D-formulations are different, 
but the Grassmannian gradient with respect to U is often 
rank-one (or low-rank). Such simple structure enables us to 
derive efficient algorithm to update U along the geodesic, as 
summarized in Algorithmic In the following derivations, we 
omit the time indices t for notational simplicity. 

A. d-formulation 

The d-formulation is motived by sliced inverse regression, 
which assumes that the response variable y depends only on the 
projected components. The loss function for the d-formulation 
takes the form of 

po X 3^ — y M, 

which measures the predictive quality of the projection for the 
response y: pe{U'^x,y) with some parameter 6. To compute 
the gradient of pg with respect to U on the Grassmannian 
manifold, we follow the program developed in m- First 
compute a partial derivative of pg with respect to U. Let the 
partial derivative of the pg function with respect to the first 
argument be denoted as 

gi = pe{U'^x,y) G K"*. 

By the chain rule, we have the partial derivative with respective 
to U is given by 



Using equation (2.70) in H), we can calculate the gradient 
on the Grassmannian from this partial derivative 

Vp, = {I-uu^)^ = {I-UU^)xgJ. 

In many problems that we describe in Section [Inj the gradient 
gi is one term or a sum a few terms. Hence, the gradient has 
the desired low-rank structure. 


where the partial derivative of with respect to the first 
arguement is given by 

52 = g^>{,Ul3,y) G K^. 

Again, g 2 is often only one term or a sum of a few terms, and 
hence has the desired low-rank structure. 

To estimate /3, for the denoising setting, we may use 

fi = dxguux\\\x — U z\\ = X. (1) 

Z 

When there is missing data, we are not able to observe the 
complete data vector xt, but are only able to observe a subset 
of entries Ut C |i7]. Using an approach similar to that in Q, 
/3 is estimated as 

/? = argmin||Ant(a; - (7z)||, (2) 

Z 

where is an n x n diagonal matrix which has 1 in the jth 
diagonal entry if j G fit and has 0 otherwise. It can be shown 
p = [UIJJq)~^U'^xq,, where Uq, = AqU and xq = Aqx. 

Algorithm 1 OSDR; meta-algorithm 

Require: a sequence of predictors and responses {xt,yt), 
initial model parameter 9 and subspace U G 
1 : for f = 1, 2 ,... do 
2: {d-formulation} 

A ^ — UW)xpg(U'^X, y) (p : X Y —>■ K} 

3: (D-formulationj 

A ^ - UW)gg{UP, y)P, {g:R^ xY -yR, 

P estimated from x} 

4: fix 9 or update U along Grassmannian gradient A 

in geodesic 

5: hx U, update 9 or d 

6: end for 


III. OSDR FOR SPECIFIC MODELS 
In the section, we illustrate various forms of loss functions 
and show that the Grassmannian gradient with respect to U 
typically takes the form of ^rw'^, for some scalar 7 G K, 
vectors r G R^, and w G 


B. D-fonnulation 

The D-formulation assumes that the loss function is defined 
in the ambient dimension space: 

gg : X 3^ —>• K. 

This setting is particularly useful for denoising and imputing 
the missing data, where we will assume the signal x lies near 
a low-dimensional subspace, and estimate a low-dimensional 
component UP and use it to predict y. The loss function takes 
the form of gg{UP, y). Following a similar derivation as above, 
the gradient on the Grassmannian can be written as 

Vgg = {I-UU^)g2P\ 


A. Linear regression 

For linear regression, 5 G M and the loss function will be the 
^ 2 -norm of prediction error. In the d-formulation, 9 = (a, b) 
with a gR'^ and b gR, and the loss function is 

peilf^x^y) = \\y - a'^U'^x -b\\. 

Dehne 

y = a'^U'^x + b. (3) 

The Grassmannian gradient of the loss function with respect 
to U is given by 

Wpg{U^x, y)^-{y-y){I- UW)x^ 
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Using the rank-one structure of the gradient above, we perform 
geodesic gradient descent for U using a simple form. Write 

Vpg{U'^x,y) 

= [»'/lkll V 2 ... diag(cr) [w/||u;|| Z 2 ZdY , 

where 

0 " = -{y - y)\\r\\\\w\\, 


V 2 , ■ ■ ■ ,Vd are an orthonormal set orthogonal to r and 
Z 2 ,...,Zd are an orthonormal set orthogonal to w. Subse¬ 
quently, using the formula in flS) update of U is given by 


TT TT (cosfcrry) - 1) , 

Unev, = u + - --rr^- -UwW^ + Sin{ar])- 

wr 


(4) 


where 77 > 0 is a step-size. Similarly, for a fixed U, we may 
find its gradient with respect to the regression coefficient vector 
and update via 


anew = a + p{y- y)U'^x, 
bnew = b + p.{y-y), 


where /i > 0 is step-size for the parameter update. 

In the Z?-formualtion, the model parameters are 1 ? = (c, e), 
with c S and d S K. Essentially, by replacing x by its 
estimate C7/3, we have the loss function 


QAUp,y) = \\y-c^up-e\\l. 


where j3 is estimated using the subspace from the previous 
step using Q or 0 - Let y = c^U/3, we can show 

Vgd{x,y) = -{y- y) {I - UU'^)a ^^, 

r 

and hence the subspace can be updated similarly, and the 
model parameters are updated via 

Cne^ = c + p{y-y)U/3, 

Snew = e + p(y- y). 


Remark 1 (Difference from unsupervised tracking). Due to 
an alternative formulation, update in OSDR differs from that 
in the unsupervised version & in that the update in 
OSDR depends on y — y. This is intuition since the amount 
of update we have on the subspace should be driven by the 
prediction accuracy for the response variable. 


Remark 2 (Mini-batch update). Instead of updating with every 
single new sample, we may also perform an update with a batch 
of samples. The Grassmannian gradient for this mini-batch 
update scheme can be derived as 

1 . ^ 

S7pg{x,y) = - — ^ (y,-?/*)(/-C/(7T)xjaT. 
i—t — B 

In this case the gradient is no longer rank-one. We may use 
a rank-one approximation of this gradient, or use the exact 
rank-B update described in m which requires computing 
an SVD of this gradient. 


Remark 3 (Computational complexity). The computational 
complexity of OSDR is quite low and it is 0{Dd). The most 
expensive step is to compute r = (I — UU'^)x or r = [I — 
UU'^)a, in the d- and D-fonnulations, respectively. This term 
can be computed as, for instance x — U(W^x), to have a lower 
complexity 0{Dd) (otherwise the complexity is 0{D^d)). 


B. Logistic regression 


For logistic regression, y G {0,1}. Define the sigmoid 
function 


The loss function for logistic regression corresponds to the 
negative log-likelihood function assuming y is a Bernoulli 
random variable. For the d-formulation, the loss function is 
given by 

pe{U'^x, y) = y\ogh{ff LT^x-\-b)-\-{l—y) \og{l—h{of LT^x-\-b)), 

and the model parameter 9 = (a, b) with a S and 6 G K. 
For the D-formulation with a parameter estimate /?, we have 


Qt>{Uffy) = y\ogh{c^Uld+e) + {l-y)\og{l-h{c^U(3+e)) 

and the model parameter d = (c, e) with c G and e G M. 
It can be shown that, in the logistic regression setting, the 
Grassmannian gradients with respect to U, for these two cases 
are given by 


S7pe = -{y-y){I-UU^)x^ 

where the predicted response 

y = h{afU'^x -\- b), 


(7) 


or 

Vy^ = -(y-y)(/-C/f/T)c , 

r luT 

where the predicted response 

y=h(c'^UP + b). ( 8 ) 

Note that the gradients for linear regression and logistic 
regression take a similar form, with the only difference being 
how the response is predicted: in linear regression it is defined 
linearly as in 0 - and in logistic regression it is defined 
through the sigmoid function as in (|^. Hence, OSDR for 
linear regression and logistic regression take similar forms and 
only differs by what response function is used, as shown in 
Algorithm 


C. Multiple linear regression 

We may also extend OSDR to multiple linear regression, 
where y G for some integer m. The loss functions for the 
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Algorithm 2 OSDR for linear and logistic regressions 

Input: yt and Xt (missing data, given Xt observed on an index 
set fit), t = 1,2,.... Initial values for U, a and b (or c 
and d). Step-sizes rj and /r. h can be linear or sigmoid 
function. 

1 : for f = 1,2,... do 

2: {d-formulation} 

y •(— h{a^U'^x -I- 5), r •<— (J — UU'^)x, 

— {y-y)\\r\\\\a\\ 

3: {U-formulation} 

/3 ^ {UlUnr^Ulxn 
y h{c^U /3) -b d, r ^ (/ — UU'^)c, 

-(2/-y)lk||||/3|| 

4: {update subspace} 

U^U+ + Sin(a7?)|;^|^ 

5: {update regression coefficients and residuals} 

a ^ a + y.{y - y)U 13, b ^ b + y.{y - y) 

6 : end for 


Algorithm 3 OSDR for multiple linear regression 

Input: yt and xt (missing data, given Xt observed on an index 
set Ot), t = 1,2,.... Initial values for U, A and b (or C 
and d). Step-sizes rj and fi. 

1 : for f = 1,2,... do 
2: {d-formulation} 

y ^ Am^x + b,r^{I- UW)x, 

w <— {y — y^A'^, a i -|k||||u>|| 

3: {H-formulation} 

P ^ {UlUn)-^Ulxn 

y ^ cm 13 + d,r^{I- Um)C{y - y), 

w = j3, a i -||r|| lircll 

4: {update subspace} 

U^U+ V ww^ + sm{ar]) ^ 

5: {update regression coefficients and residuals} 

A ^ A + yU'^x{y — yY {d-formulation} 

C ^ C + yUf3{y — yY {D-formulation} 

6 : end for 


d- and O-formulations are given by 

pe{mx,y) = \\y - A'^mxWl, 
gYUYy) = \\y-cmp\\l 

with 6 = A G and -d = C G Here we 

assume the slope parameter is zero and this can be achieved 
by subtracting the means from the predictor vector and the 
response variable, respectively. It can be shown that 


given by 


K-2 


pe{mx,y) = - XI % = j}log 
-l{y = K- l}log 


x+bk 


1 + Ef=o' 






- {I - UW)x{y - vYA'^j 

r w'^ 


y = A^mx, 


In this case, the Grassmannian gradient will no longer be 
rank-one but rank-AT, with 


and 


yQt} = - {I-UUYC{y-y) , y = CmY 

r 

It can also be shown that the partial derivative of pg with 
respect to A for a fixed U is given by —W^xijj — yY, and 
the partial derivative of Qg with respect to C for a fixed U is 
given by —U(3{y — yY■ OSDR for multiple linear regression 
is given in Algorithmic 


Vpe = -(/-(7[/T)S, 


and 


K-2 


^=Y.^{y = k} 

A.-0 

-bl{y = a: - 1} 


afc/3^ - 


1 


K-l 


1 _|_ (,alU0+bk 




up+bi 


aiP'' 




1 + Efo 


1=0 

1=0 

(9) 


D. Multinomial logistic regression 


Note that S consists of a sum of K terms and, hence, is usually 
rank-A'. We no longer have the simple expression to calculate 
update of U along the geodesic and the precise update requires 
performing a (reduced) singular value decomposition of the 
gradient 


Multinomial logistic regression means that y G 
{0,1,..., AT — 1} for some integer A' is a categorical random 
variable and it is useful for classification. In the following, we 
focus on the H-formulation; the d-formulation can be derived 
similarly. The loss function is the negative likelihood function 


where E G is a diagonal matrix with the diagonal 

entries being the singular values. Using Theorem 2.3 in | fT8| , 
we update U as 


U= [UQ 


P] 


cos(E?7) 

sin(E? 7 ) 


QY 


( 10 ) 
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where 77 > 0 is the step-size. Alternatively, we may use the 
rank-one approximation to the Grassmannian gradient to derive, 
again, a simple update, which is given by Algorithm 


Algorithm 4 OSDR for if-multinomial logistic regression 

Input: yt and Xt (missing data, given xt observed on an index 
set fit), t = 1, 2,.... Initial values for U, ak- Step-sizes 
rj and y. 

1 : for f = 1,2,... do 

2 : {D-formulation} 

/3 ^ (U^Unr^U^xn 

3: {predict response} 

ykOce<^l^+^>‘, k = l,...,K -1 

Vk ^ yk/{i + T.fji^ yi) 


4: 

5: 


yx ^ 1- yi^ 

compute S using (|^ 

hnd the dominant singular value a, corresponding left 
singular vector p and right singular vector r for (/ — 

UW)T. 

{update subspace} 
p <— Ur 

U^U+ Urr^ + siniar,) ^ ^ 

{Update regression coefficients} 


1 +E 


-l{y = K}- 


'lufi + bk 




Ofe Ufc -f p,hUI3, k = 1,..., K — 1 


bk bk + /i/i, 

end for 


k = 1, 


.K-l 


E. Support vector machine (SVM) 

The loss function for SVM is the hinge loss. For the d- and 
G-formulations, the loss functions are 

pg(U'^ X, y) = max{0,1 — ya^Wx}, 

where 0 = a S and 

Qt)(.U/3,y) = max{0,1 - yc'^U^}, 

where 9 = c G K^. Note that the loss function is not 
differentiable. We may use its sub-gradient to perform gradient 
descent, or find a smooth surrogate function to approximate 
the hinge loss. The Grassmannian sub-gradients for the two 
loss functions are 

Vpe = ^[sgn(ya'''17'''a; -f 1) -f 1] (/ — UW)x ad , 

2 s-- 

and 

= |[sgn(yc'rC/^ + 1) + 1] (I -UU'^)c f3'^ . 

r 

These gradients are again rank-one and, hence, we may update 
U along geodesic efficiently. 


E Random dot product graph model 

The random dot product graph model is useful for relational 
data which usually occurs in social network study 119|- 0 - 
The model assumes that each node is associated with a 
feature Pi, and an edge between two nodes are formed with 
a probability proportional to the inner product between their 
features P'j Pj. Suppose at each time t, we observe a pair of 
nodes in the network with predictor vectors xi^t and X 2 ,t as 
well as their relation indicator variable yt G { 0 , 1 } (i.e., an 
edge is formed or not). We assume a logistic regression model 
P(yt = 1 ) = h{atP\ tP 2 ,t + bt) for some feature vectors Pi,t 
and P 2 ,t that are projections of xi t and X 2 ,t- Here our goal 
is to choose a proper subspace that hts the data nicely. 

Note that given xi and X 2 , the inner product can be estimated 
as x\UU'^X 2 , which involves a quadratic term in U (rather 
than linear in U as in other cases). To be able to obtain the rank- 
one stmcture, we use a two-step strategy: first fix P 2 = U'^X 2 
and update U, and then fix Pi = W^xi and update U. The 
log-likelihood function for hxed P 2 = U'^X 2 is given by 

gs{UP2,y) = y log h{axlUP2 + b) 

+ {l-y) log(l - h{axlUP 2 + b)). 


Similar to logistic regression, 

= {y — h{ax\UP 2 + b)) (/ — UW)axi /?}, 

r 

which is rank-one and we may update the subspace similarly. 
In the second step, the log-likelihood function for fixed Pi = 
W^Xi is given by 

Qt){UPi,y) = y log h{aPlU'^X 2 + b) 

+ (1 - y) log(l - h{aPl Wx2 + b)), 

and the subspace U can be updated similarly. Finally, we hx 
U (and hence fix Pi = W^xi and P 2 = U'^X 2 ) and update the 
logistic regression parameters as 

a„ew = a + p{y - h{aPlP 2 + 6))/3}^2, 

('new = b + p{y- h{aPlP 2 + b)). 


Description of the complete algorithm is omitted here as it is 
similar to the case of logistic regression. 


G. Union-of-subspaces model 


Union-of-subspaces model |22| and multi-scale union-of- 
subspace model p3|, |231, p4) have been used to approximate 


manifold structure of the state. As illustrated in Fig. the 
multi-scale union-of-subspace is a tree of subsets defined on 
low-dimensional affine spaces k)&At with each of 

these subsets lying on a low-dimensional hyperplane with 
dimension d and is parameterized by 

Sj,k,t = {/3 G '■ V = Upk,tP + 

P^k-lt^<l, PgR'^}, 
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where j G {1,..., Jj} denotes the scale or level of the subset 
in the tree, Jt is the tree depth at time t, and k G {1,... ,2^} 
denotes the index of the subset for that level. The matrix 
Uj,k,t G is the subspace basis, and is the 

offset of the subset from the origin. The diagonal matrix 


A 


j,k,t 


= diag{A 


( 1 ) 




T)dxd 


with A^'^^ t ^ 0’ contains eigenvalues of the 

covariance matrix of the projected data onto each hyperplane. 
This parameter specifies the shape of the ellipsoid by capturing 
the spread of the data within the hyperplanes. 

We may couple the subspace tree with regression model, by 
attaching a set of regression coefficients {aj^k,t,bj^k,t} with 
each subset. Given x, we may find a subset in the union that 
has the smallest affinity, and use that subset to estimate /3 by 
projection onto the associated subspace and use the associated 
(linear or logistic) regression coefficients to predict y. The 
affinity can be a distance to the subset similar to what has 
been used for discriminate analysis or in HD, 

= argminmin(x - Uj^k,tw )^“ Uj^k,tw), 

j,k w 


or simply the distance to a subspace 


(j*, k*) = argminmin ||x - Uj^k,tw\\. 

j,k w 

Then we predict the local coefficient associated with that subset. 
OSDR can be derived for these models by combining a step 
of finding the subset with the smallest affinity with subsequent 
subspace and parameter update similar to the linear or logistic 
regression OSDR. 

We may also use this model together with the random dot 
product graph model for social networks, where two users 
may belong to two different subsets in the tree and their 
interaction is determined by the regression model associated 
with their common parent in the tree. This may capture the 
notion of community: each node represents one community 
and there is a logistic regression model for each community. 
The probability that two users interact is determined by the 
“smallest” community that they are both in. In this case, OSDR 
will be two-stage: classification based on affinity function 
followed by a two-step subspace update similar to the OSDR 
for the random dot product model. Section V-C presents one 
such example for illustration. 


IV. Theoretical analysis 

The general OSDR problem is hard to analyze due to its 
non-convexity and bilinearlity in U and the model parameters. 
In this section, we study the optimization problem for linear 
regression with the U-formulation to obtain some theoretical 
insights. In the linear regression case, the loss function is the 
£2 norm g^{U/3,y) = {y — c^UPY. When there is no missing 
data, the projection coefficient is given by ^ = V^x. In a 
simplified setting, assume the response variable is generated 




Fig. 3: Multi-scale union-of-subspaces model (24j . Data are bi- 
partitioned iteratively to form nested subsets, and a low-dimensional 
affine space (with offset from the origin) is fitted for data in each 
subset. The bi-partitioning creates a binary tree with multi-scale 
representation of the data: nodes of the tree contain parameters for 
that subset, and leaves represent the union-of-subspaces used at time t. 
The lower panel illustrates the structure of these subspaces. The black 
dots represent historical data Xt used to estimate the subspace. This 
model can be used in conjunction with, for example, a random dot 
product graph to model social networks: at each time t, two persons 
with features and /32 from two subsets interact with probability 
proportional to the inner product f5'[P2- 


with the parameter c: y = c^x. Then the loss function is given 
by 

Qa{up,y) = UU'^)xf. 


A. Fixed-point with respect to U 

First, we show that for a fixed model parameter, the optimiza¬ 
tion problem with respect to the subspace U will converge to an 
orthonormal matrix even without the orthogonality constraint. 
We make a simplifying assumption that the true response 
is linear with parameter equal to the assumed parameter 
c: y = c'^x, then we have that one step in the alternating 
minimization can be written as 

minimize (c^ (I — ULT^ixY 

u ( 11 ) 

subject to U'^U = Id- 

This problem is non-convex due to the constraints as well as 
the quadratic term UIF^ in the objective function. Without 
loss of generality, we may assume ||c ||2 = 1. Construct a 
matrix Cq with c being its first column. Then we consider the 
following optimization problem related to ( [TT] l that will help 
us to establish properties later. 

minimize L{U,Co) ^ E\\C^,{I - UUr)x\\l 
subject to WU = Id- 

Theorem 1. Given a fixed orthogonal matrix Cq, the station¬ 
ary point U* to the optimization problem without the 
constraint: 

minmize EIICq (I — (7C/''')a;||2, (13) 
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are orthogonal matrices of size D x d whose columns are d 
largest eigenvectors of the covariance matrix X of data x. 
Assume X has d distinct dominant eigenvalues. 


We need the following lemma for the proof. 


Lemma 1 ( 1251). Let Co G he a positive semi-definite 

matrix and U G For a function J : i—>• K defined 


use (j)i{Ut,U*), the principal angles between Ut and the true 
subspace U*, which is defined as 

cosfifiUt,U*)=afiU*^Ut), iGldj 

as a metric. Further define 

d 

et = 51 sin" UU*,Ut) = d- \\U*^Ut\\%. 

i=l 


J{U) = tr(Co) - 2tr(C/TCoC/) + tr{U'^CoU ■ WU), 
the gradient of J is 

XJ{U) = -2[2Co - CoC/C/T - UW^Cop. 

U* is a stationary point of J{U) if and only if U* = UdQ, 
where Ud G contains any d distinct eigenvectors of C 

and Q G is an arbitrary orthonormal matrix. Moreover, 

all stationary points of J{U) are saddle points except when 
Ud contains the d dominant eigenvectors of Cq, in which case 
J{U) attains the global minimum at U*. 

Proof of Theorem Let X denote the covariance matrix of x\ 
X = and let C = CqCq. We may write the objective 

function as 

L{U,Co) = E[tr{CpI - UU'^)xx'^ {I - UW^po)] 

= ti{(X - XUU^ - UWX + UWXUW)C) 

= -2[C(/ - UW)X + X{I - UU'^C)p. 

Then the partial derivative of L{U,Co) with respect to U is 
then given by 

dLpPo) dpiXUU'^C)) dppU'^XC)) 
dU “ dC dU 

ppU^XUU^C)) 

= -2{CUWX + XUWC -CX - xcp. 

If we choose the columns of Cq properly that Cq = 
{c,C 2 ,... ,cd) is orthonormal, we have C = CqCq = Id, 
and thus, 

^ + X{I - C/C/T)]C/. 

With the equation above together with Lemma [T] we have that 
the only stationary points of the optimization problem U* are 
d distinct dominant eigenvectors of the matrix X (assuming 
X is full-rank). 

□ 


B. Convergence 

In the same setting, if we fix the model parameter, we 
may establish the following local convergence property with 
respect to the Grassmannian gradient of U. Suppose the case 
where x is exactly on the subspace U*, and x = U*s. We 


Note that when there is no missing data, p = Ufi, r = {I — 
UW)a, and 

r'^p = 0. 


Typically we can assume yt — ytfi^ 0. Hence, we can choose 
a set of step-sizes pt > 0 properly such that 

Ikf = 11^11"-Ibf = pf-||/3f. (14) 


Define 9t such that 


cosOt = 


M 

ibtir 


sin 9t 


M 

ibtir 


If such pt > 0 exists, we may choose the constant c* = 1; 
otherwise we may choose ct accordingly to satisfy ( [l4| . When 
there is no missing data, it is easy to check that Lemma 3.1 
in GD still applies: if we choose the step size ry such that 
Vt = dj/ut, then we have 


et - et+i = (1 


PJ At At Pt 

PjPt ’ 


(15) 


where At = Uj U*. 


Next, we establish conditions for the alternating minimiza¬ 
tion used in OSDR to converge. The following lemma from 
1261 comes handy. 


Lemma 2 (Theorem 4 in ||26)). Let {A4,d) be a compact 
metric space. Given two sets V,Q G Xi, define the Hausdorff 
distance between them as 


dniV, Q) = max |suppgpinfQgQd(P, Q), 

supQ6Qinfpg-pd(P, (5)}- 

Let {{Vn, Qn)}n>o, 'P, Q be compact subsets of the compact 
metric space (A4, d) such that 


Vr, 


dn , 


r,Qr, 


dn, 


Q 


and let i : X4 X X4 ^ M. be a continuous function. If there 
exists a function <5 : x Ad —>■ K, and the following two 

conditions hold: 

(a) for all n > P G Vn, Q G Qn-i, P = 
argminpg-p^ £(P, Q) 

S{P,P)+i{P,Q)<eiP,Q), 

(b) for all n > 1, P,P G Vn, Q G Qn, Q = 
argminQgQ^ £{P,Q), 

£iP,Q)<mQ) + SiP,P), 
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then the adaptive alternating minimization algorithm C. e-net 


P* G arg mm e{P,Qn-i), 

Q*n e i{Pu,Q), 

with {PqtQo) initialization converges. 


Lemma can be applied to our setting in the following 
sense. Let q = {I — UU'^)x and P = cc'^, we have that our 
optimization problem 


minimize {c^ {I — UU'^)x)^ 

U,c 

subject to WU = Id 

yt = c'^xt, 


can be recast into 

minimize l{P,q) = q^Pq 

U,c 

subject to P G Vt, 

Q S Qt, 


where 


g, = {z G : 3L G VW = Id,z= (I-VV^)xt}, 


and 

Pt = {Z G : 3?; G = v'^Xt, Z = vv'^}. 

For simplicity, denote Pn,Pn and qn,qn be arbitrary items 
from Vn and g„, respectively, and 

Pq^n = arg^m i{P,q), 

and 

qp,n = arg min i{P, q). 
qeQn 

Suppose we can choose the data such that limt_>oo Xt = p 
and yt should also converge to some v, then there exists P, 
g such that Vt P and Qt Q. In order to be able to 
choose such that 

<5(P„,Pn) > -^{Pn,qn), 

^{Pn: P{qj^_i,n}^ — ^(Pnj^n—l) 5 l)i 

we will need to have 


^{Pniq-a—l) ^^P{qn-i,n}iqn—l) 

^I{Pnj q{P{q^_i,n},n}') ^{Pni qn); 

then we will need 

^{Pm qn—l) ^{P{qn-i,n} t 9n—l) n St 

> f(P„,g{p^.^_^ - f(P„,(?{p„,„}). '' 

If the input sequence {xt} is properly selected such that for 
any P„ G P„ and qn-i G Qn-i, the inequality ( |18[ ) holds, 
then with Lemma we have the adaptive alternating algorithm 
for the linear regression problem converges. 


Another technique we may explore to tackle the non-convex 
optimization problem in ( [T3| ) is the efficient discretization. 
The idea is that instead of using alternating minimization, for 
the non-convex optimization problem involved, we can find 
a sufficiently fine yet efficient discretization (as function of 
the desired error guarantee) that allows us to replace a single 
non-convex optimization problem by a polynomial number of 
convex problems. This will not lead to practically efficient 
algorithms as everything beyond quadratic or cubic running 
time in the size is usually prohibitive. However, it will allow 
us to establish general, theoretical guarantees and guide the 
search for better practical algorithms. In particular, we can 
adopt approaches similar to the e-net approach in |27| . This 
provides a discrete set S of size IIS'! < (l/e)'^ so that for all 
y G K", there exists a point yo G S such that || Ay —j/olioo < £■ 
This approximation now allows us to handle bilinear terms of 
the form x'^ Ay, which are non-convex, by replacing them with 
the two terms x'^yo and ||Ay — yolloo < e and iterating over 
all possible choices yo G S. With the help of the following 
lemma, we may establish our result. 


Lemma 3 (e-net for positive semidefinite matrix |27|). Let 
A = BB'^, where A is an D x D positive semidefinite matrix 
with entries in [—1,1] and B is D x d. Let A = A„ = 
{x G K^,||a:||i = l,x > 0}. There is a finite set S G 
independent of A, B such that 

Vx G A, 3x G 5 s.t. ||i7'''x — xjjoo < ^ 

with |5| = C7((l/e)‘^). Moreover, S can be computed in time 
0{{lleYpo\y{D)). 


Assume ||x||i = a > 0. From lemmawe can compute 
such a set S in time 0{{a/eYpolyi^D)) such that 3x G 5 
and ||[/■'■x — xjjoo < £/d. Let S = {xi,... .X|5|} where |5| = 
C7((a/e)‘^). We can then approximate a related problem to 
(12i by a family of |5| problems: 


min E||C'q(x—( 7xi 


1^, 


i = 1, 


J5|. (19) 


By combining all sub problems in ( [T9l l together, we have the 
following equivalent problem 


min E|1CT(A5-[/A5)|| 




( 20 ) 


where 


A5= [a 


| 5 | 

and Xs = [xi ... i| 5 |]- Note that the set S depends on 

X, and by construction Cq N orthogonal, we approximate ( [20] ) 
by 

mm||Cj(X5 - UXs)\\% = mmjlAs - UXsWf, (21) 

which is convex. 
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Lemma 4. An e-net approximation to 0 can be computed 
in polynomial time 0((a/e)^'^ • poly(I?d). 

Proof. The complexity of this epsilon-net method is the time 
for computing S plus the time for computing the optimization 
group generated by epsilon-net. From lemma computing 
S takes time 0((a/e)‘^poly(Z?)) and the size of the set is 
|iS| = 0{{a/eY). The dimension of the matrix in this problem 
is Z? X |5|. With the result in |28) , we may use an iterative 
iterative algorithm to solve the problem and the computational 
complexity in each iteration is 0{D‘^\S\ -f |iS|^). This leads to 
a polynomial time algorithm with complexity 0{poly{Dd)). 
Hence, the total time needed for generating and solving the 
e-net approximation problems is 

0{{a/efpo\y{D) + 0{{a/ef‘^) ■ 0{poly{Dd)) 
=0{{a/ef^-poly{Dd). 

□ 


V. Numerical examples 

We demonstrate the performance of OSDR compared with 
the online dimensionality reduction (ODR) (learning a subspace 
online by minimizing ||a; — t//Z|| without using information of 
y) via numerical examples on simulated data and real data. 
We start with a simple numerical example, followed by a 
larger scale simulation, an example to illustrate the union-of- 
subspaces and random dot product model, and finally real-data 
experiments. 


A. Simple subspace tracking 

Static subspace, logistic regression. We first generate data 
by embedding a static low-dimensional space of intrinsic 
dimension d* = 2 into a high dimensional space with 
dimension D, and generating a sequence of fit vector such that 
the entries of (3t are i.i.d. Af{0, 1) and lies within an ellipse: 

/3li/rl + l3yrl<l. ( 22 ) 

The predictor vector xt € is formed as xt = UPt Wt, 
where Wt is a Gaussian noise with zero mean and variance 
equal to 10“^. The logistic regression coefficient vector is 
along the shorter axis of the ellipse. Among the 6000 samples 
generated this way, the first 3000 samples are for training, and 
the remaining 3000 samples are for testing. Given xt, ODR 
or OSDR predict the label yt, then we reveal the true label 
to calculate error, and then use {xt,yt) to update. We use 
misclassification error Pe on the test data as our performance 
metric. In Fig. OSDR outperforms ODR by an almost two 
order of magnitude smaller error. 
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(a) 


(b) 


Fig. 4: Tracking a static subspace with D — 100 and true intrinsic 
dimension d* = 2. We plot the dimension of the subspace d versus 
misclassification errors for different aspect ratio of the ellipse in 
(a) ri/r 2 = 3; (b) ri/r 2 = 5; 


of OSDR to handle dynamic data. Assume Ut = UoR{t) with 
the rotation matrix given by 


R{t) 


COs(Q;t) — sin(Q;t) 
sin(at) cos(at) ’ 


(23) 


where at is the rotation angle, and Uq G is a random 

initial subspace. The vector jdt is again generated with entries 
i.i.d. and lies within an ellipse described by ( p2) l. The predictors 
Xt is generated as the last example. The rotation angle at 
follows 


Q!i = 



t-500 

6000-500’ 


if f < 500; 
if 500 < f < 6000; 


(24) 


where r is the rotation speed (smaller r corresponds to faster 
rotation). The logistic regression coefficient vector is along the 
shorter axis of the ellipse. Fig. shows Pf. for various ration 
speed, where, again OSDR significantly outperforms ODR. 



Rotating subspaces, logistic regression. Next we consider 
tracking a time-varying subspace to demonstrate the capability 
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Fig. 5: Rotating subspace with aspect ratio of the ellipse ri/r 2 = 10. 
Rotation ratio r versus misclassification rate when (a) d = 30; (b) d 
= 10; (c) d = 5; (d) d = 2. 






































Static subspace, linear regression. The third simple example 
compares the performance of OSDR with ODR in the linear 
regression setting. The setup is similar to that of tracking a 
static subspace, except that the response variable yt is generated 
through a linear regression model yt = c'^U/3 + b + et, with 
D = 2, d = I, c = [ci,C 2 ]^ and ~ A/'(0,(5^) with S'^ = 
10“^. We use the rooted mean squared error (RMSE) on the 
test data as our performance metric, which is the square root of 
the averaged square error on the predicted y differs from true y. 
Fig- HI shows the RMSE associated with OSDR is significantly 
lower than that of ODR. 



Fig. 6: Static subspace, OSDR linear regression: RMSE versus 
log(ci/c 2 ) versus log(RMSE) when (a) r\/r 2 = 1; (b) ri/r 2 = 2. 



dimension configurations 


(a) Type I Results 


dimension configurations 


(b) Type II Results 


Fig. 7: Synthetic classification dataset, (a) test errors Pe under type 
I; (b) test errors Pe under type II. The ten (D, d) pairs are (2, 1); (3, 
2); (10, 2); (10, 4); (10, 6); (10, 8); (50, 10); (50, 20); (50, 30); (50, 
40), and have x-labels from 1 to 10 in the figure, correspondingly. 


B. Synthetic classification dataset 

We consider a larger scale example by generating a synthetic 
dataset for two-class classification. The dataset has N = 10000 
samples {a;„} G , n = 1,... ,N. This generating process 
guarantees that the eigenvalues of the covariance matrix for the 
dataset are picked from D, D — 1,... ,1 (This is done by first 
generate a random subspace, then project the random generated 
data into this subspace, and also scale the dimensions to have 
the required eigenvalues). An example when D = 3 are shown 
in Fig. 1^ 

The labels of the data are obtained as follows: Suppose 
the covariance matrix of data is X. We first find the eigen¬ 
values [Ai, A 2 ,..., Ad] and the corresponding eigenvectors 
[til, t; 2 , •. ■, "Ud] where Ai > A 2 > ... > Ad; then select an 
eigenvector Vp,p G [1, •. •, D], and project the data onto this 
vector. After sorting the projected values, we label the first half 


as positive, and last half as negative. Consider two settings 
for selecting p: type-I: p = d+1', type-II: p = d + {D — d)/2. 
Clearly, type-II will be harder, since the corresponding variance 
of projected data will be smaller. 

We use half of the data for training, and another half for 
testing. The tuned learning rate p G [10“^, 10“^, 10“"^] and 
T] G [10-2,10-3,10-4] for both ODR and OSDR. The mean 
accuracy Pe are reported after 10 rounds of experiments for 
each setting. Besides different types of labelling directions, we 
also evaluated different combinations of {D, d) pairs, as shown 
in the Fig. The results show that OSDR outperforms the SDR 
(baseline) significantly. This is expected, as the first d leading 
directions of principle components contain nothing about the 
label information; therefore, the unsupervised ODR will fail 
almost surely. The simple example shown in Fig. [^proves that 
OSDR can identify the correct direction for projecting data. 

C. Union-of-subspaces combined with random dot product 
model 

We generate an example for interaction of two nodes with 
features fii and (32 through a random dot product graph model 
defined over al union-of-subspaces, as illustrated in Fig. 
There are three sets which correspond to the three leaf nodes 
in the tree. Each node in the tree is associated with a subset 
lies on a subspace and also a logistic regression coefficient 
vector. At each time, two predictor vectors and X 2 ,t 
of 100 dimensions are observed (which may belong to the 
same subset or different subsets) and their interaction yt is 
generated through logistic regression model that depending 
on their inner product. In this example, we also assume there 
are missing data: only 40% of the samples are observed 
and the variance of the noise is 0.01. The subsets in the 
tree are also time varying. The subspace associated with 
the root node is a random orthogonal matrix that rotates: 
(7i,t = exp(i?(:) with R being a random per-symmetric 
matrix that determines the rotation. The children nodes of 
the root node are slight rotation of the subspace of their 
parent node: U 2 ,i,t = exp(i?)[/i,i,t, (72, 2 ,t = exp(-i?)(7i,i,t, 
= exp(i?/2)(72,i,t, ( 73 , 2 ,t = exp(-i?/2)(72,i,t. Results 
in Table shows that OSDR outperforms the conventional 
online logistic regression (which does not perform dimension 
reduction and ignores the tree structure). 

TABLE I: Comparison of Pe for data generated from a union-of- 
subspaces combined with random dot product model. 


Online 

logistic regression 

Hierarchical 

OSDR 

0.2133 

0.1440 


D. Real-data experiments 

USPS dataset. We test OSDR logistic regression on the well- 
known USPS dataset of handwritten digits. The dataset contains 
a training set with 7291 samples, and a test set with 2007 
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Fig. 8: USPS digits recognition. Dimension of the subspace d versus 
misclassification rate for OSDR and ODR, respectively. 

samples. The digits are downscaled to 16 x 16 pixels. To 
demonstrate a online setting, we read one picture at a time 
with the task of classifying whether a digit is “2” or not. Again, 
we hrst use vectored image as xt, predict label yt, and then 
reveal the true label followed by using (xt, yt) to update the 
logistic regression model. Fig. [^demonstrates OSDR compared 
favorably with ODR for this task. The improvement is not 
significant; which may be due to the fact that in this case 
the dominant eigenvectors captures most useful features for 
classihcation already. 

“Boy or girl” classification. We next perform a “boy or girl” 
classihcation task on an image set we collected from college 
students enrolled in a class. This dataset contains 179 grey 
scaled images, where 116 of them are from boys. Each picture 
has the resolution of 65x65, thus the original space has a 
dimension D = 4225. We train both algorithms from raw 
images. For the parameter search, we tune the learning rate 
n e [10-2,10-3,10-4] and p e [10-2,10-3,10-4] for both 
algorithms. Experiments are carried out with different settings 
of d G [10, 50,100,150]. The mean test error of 5-fold cross 
validation on this dataset are reported in Table [^ for each 
conhguration. 

TABLE II: Average test error after 5-fold cross validation on Boy- 
vs-Girl dataset. 


Method 

o 

1—1 

II 

d = 50 

d= 100 

d= 150 

ODR 

0.4800 

0.4629 

0.4517 

0.4229 

OSDR 

0.3314 

0.2343 

0.2171 

0.1714 


For this dataset, the OSDR algorithm signihcantly out¬ 
performs ODR. To gain a better understanding for its good 
performance, we examine the subspace generated from the 
experiment. We hrst visualize the top 7 vectors in the basis 
of subspace U for OSDR and ODR, respectively. We reshape 
each vector into a image and, hence, this displays the so- 
called “eigenface”. Note that the eigenfaces generated by the 
unsupervised (corresponding to online PCA) and the supervised 
subspace learning are very different. The online PCA keeps 
some facial details, while the OSDR algorithm is getting some 
vectors that are hard to explain visually, but may actually 
captured details that are more important for telling apart boys 
and girls. 



Fig. 9: Visualization of top 7 basis in subspace U : “eigenfaces”. 
Images are processed via blurring and contrast enhancement for 
better visualization. The hrst row corresponds to the basis obtained 
from ODR (baseline) algorithm, while the second row consists of 
basis from the OSDR algorithm. 



(a) baseline boys (b) baseline girls (c) OSDR boys (d) OSDR girls 


Fig. 10: Mean reconstructed image for boys and girls, (a) and (b) are 
showing results obtained from ODR algorithm, while (c) and (d) are 
showing results created by OSDR. Images are processed via blurring 
and contrast enhancement for better visualization. 


We further examine the average image of reconstmcted faces 
x = (7/3 in Fig. We compute the average reconstructed 
image of boys and girls separately, so as to evaluate the 
discriminative ability of both algorithms. We can see the 
unsupervised subspace tracking (online PCA) obtained many 
details of the facial attributes, and the reconstructions of 
boys and girls have little differences, which makes it hard to 
distinguish the two genders. So in this case, the unsupervised 
algorithm fails because of the lack of supervised information. 
In contrast, the supervised OSDR extracts two very different 
“average” faces for the boy and the girl, respectively, although 
these average faces do not directly reflect any facial detail. 
Interestingly, from the contour we learned that, the most 
discriminative attribute learned by OSDR is the hair (see the 
dark part in Fig. 10 (d) around shoulder of the girls). This is 
a straightforward and efficient feature for distinguishing boys 
and girls. This example clearly demonstrate that OSDR focus 
on extracting the component that differentiate two classes, 
and hence, is quite suitable for dimensionality reduction in 
classihcation problems. 
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